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Abstract 

Recovering  a  low-rank  tensor  from  incomplete  information  is  a  recurring  problem  in  signal 
processing  and  machine  learning.  The  most  popular  convex  relaxation  of  this  problem  minimizes 
the  sum  of  the  nuclear  norms  of  the  unfoldings  of  the  tensor.  We  show  that  this  approach 
can  be  substantially  suboptimal:  reliably  recovering  a  K- way  tensor  of  length  n  and  Tucker 
rank  r  from  Gaussian  measurements  requires  f2(rnx_1)  observations.  In  contrast,  a  certain 
(intractable)  nonconvex  formulation  needs  only  0(rK  +  nr I\)  observations.  We  introduce  a  very 
simple,  new  convex  relaxation,  which  partially  bridges  this  gap.  Our  new  formulation  succeeds 
with  0(r  LK/2in^K/2~\ )  observations.  While  these  results  pertain  to  Gaussian  measurements, 
simulations  strongly  suggest  that  the  new  norm  also  outperforms  the  sum  of  nuclear  norms  for 
tensor  completion  from  a  random  subset  of  entries. 

Our  lower  bound  for  the  sum-of-nuclear-norms  model  follows  from  a  new  result  on  recover¬ 
ing  signals  with  multiple  sparse  structures  (e.g.  sparse,  low  rank),  which  perhaps  surprisingly 
demonstrates  the  significant  suboptimality  of  the  commonly  used  recovery  approach  via  minimiz¬ 
ing  the  sum  of  individual  sparsity  inducing  norms  (e.g.  h ,  nuclear  norm).  Our  new  formulation 
for  low-rank  tensor  recovery  however  opens  the  possibility  in  reducing  the  sample  complexity 
by  exploiting  several  structures  jointly. 


1  Introduction 


Tensors  arise  naturally  in  problems  where  the  goal  is  to  estimate  a  multi-dimensional  object  whose 
entries  are  indexed  by  several  continuous  or  discrete  variables.  For  example,  a  video  is  indexed 
by  two  spatial  variables  and  one  temporal  variable;  a  hyperspectral  datacube  is  indexed  by  two 
spatial  variables  and  a  frequency/wavelength  variable.  While  tensors  often  reside  in  extremely  high¬ 
dimensional  data  spaces,  in  many  applications,  the  tensor  of  interest  is  low-rank ,  or  approximately 
so  |IKB09| .  and  hence  has  much  lower-dimensional  structure.  The  general  problem  of  estimating  a 
low-rank  tensor  has  applications  in  many  different  areas,  both  theoretical  and  applied:  e.g.,  esti¬ 
mating  latent  variable  graphical  models  AGH+12].  classifying  audio  IMSS06],  mining  text  |GG12|. 
processing  radar  signals  jDNIOj.  to  name  a  few. 

In  most  part  of  the  paper,  we  consider  the  problem  of  recovering  a  K-  way  tensor  X  £  R"1  x"2  x  xraj<r 
from  linear  measurements  2  =  G[X]  £  Rm.  Typically,  m  -C  N  =  JjT  rii,  and  so  the  problem  of 
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recovering  X  from  z  is  ill-posed.  In  the  past  few  years,  tremendous  progress  has  been  made  in 
understanding  how  to  exploit  structural  assumptions  such  as  sparsity  for  vectors  iGRTl  or  low¬ 
rankness  for  matrices  IRl'PK)  to  develop  computationally  tractable  methods  for  tackling  ill-posed 
inverse  problems.  In  many  situations,  convex  optimization  can  estimate  a  structured  object  from 
near- minimal  sets  of  observations  [NRWY  12l  ICRP  W  l~2l  IALMT13] .  For  example,  an  n  x  n  matrix 
of  rank  r  can,  with  high  probability,  be  exactly  recovered  from  Cnr  generic  linear  measurements, 
by  minimizing  the  nuclear  norm  H-YH*  =  y~h  aj(X).  Since  a  generic  rank  r  matrix  has  r(2n  —  r) 
degrees  of  freedom,  this  is  nearly  optimal. 

In  contrast,  the  correct  generalization  of  these  results  to  low-rank  tensors  is  not  obvious.  The 
numerical  algebra  of  tensors  is  fraught  with  hardness  results  IHL09| .  For  example,  even  computing 
a  tensor’s  (CP)  rank, 


rankcp(AT)  =  min  • 


X  =  af1  o  a ^  o  ■  ■  ■  o  a $ 

i=  1 


(1.1) 


is  NP-hard  in  general.  The  nuclear  norm  of  a  tensor  is  also  intractable,  and  so  we  cannot  simply 
follow  the  formula  that  has  worked  for  vectors  and  matrices. 

With  an  eye  towards  numerical  computation,  many  researchers  have  studied  how  to  estimate 
or  recover  tensors  of  small  Tucker  rank  [Tuc66l .  The  Tucker  rank  of  a  K- way  tensor  AT  is  a  K- 
dimensional  vector  whose  i-th  entry  is  the  (matrix)  rank  of  the  mode-?  unfolding  X  (q  of  X\ 


ranktc(AT)  :=  (rank(AT(1)),  rank(AT(2)),  •  ■  •  ,rank(AT(if))). 


(1.2) 


Here,  the  matrix  X ^  £  jgyu  xll^,:  is  obtained  by  concatenating  all  the  mode-?'  fibers  of  X  as 
column  vectors.  Each  mode-i  fiber  is  an  n^-dimensional  vector  obtained  by  fixing  every  index  of  X 
but  the  ?-th  one.  The  Tucker  rank  of  X  can  be  computed  efficiently  using  the  (matrix)  singular 
value  decomposition.  For  this  reason,  we  focus  on  tensors  of  low  Tucker  rank.  However,  we  will  see 
that  our  proposed  regularization  strategy  also  automatically  adapts  to  recover  tensors  of  low  CP 
rank,  with  some  reduction  in  the  required  number  of  measurements. 


The  definition  (1.2)  suggests  a  very  natural,  tractable  convex  approach  to  recovering  low-rank 
tensors:  seek  the  X  that  minimizes  JT  A,;  || X ^  out  of  all  X  satisfying  Q[X]  =  z.  We  will 
refer  to  this  as  the  sum- of -nuclear-norms  (SNN)  model.  Originally,  proposed  in  |LMWY09],  this 
approach  has  been  widely  studied  |GRY111  ISDS101 ITHK101  ITSHKIll  ISTDLS13]  and  applied  to 
various  datasets  in  imaging  |SVdPDMSlll  ISHKM131 IKS131 ILL101  ILYZYlOj . 

Perhaps  surprisingly,  we  show  that  this  natural  approach  can  be  substantially  suboptimal ,  and 
introduce  a  simple  new  convex  regularizer  with  provably  better  performance.  For  ease  of  stating 
results,  suppose  that  n\  =  ■■■  =  uk  =  n,  and  ranktc(Af)  ^  (r,  r,  •••  ,r).  Let  denote  the 
set  of  all  such  tensors.  We  will  consider  the  problem  of  estimating  an  element  Xq  of  Tr  from 
Gaussian  measurements  Q  (i.e. ,  Zi  =  (Qj.X),  where  Q ,  has  i.i.d.  standard  normal  entries).  To 
describe  a  generic  tensor  in  Tr,  we  need  at  most  rK  +  rnK  parameters.  Section  [2]  shows  that  a 
certain  nonconvex  strategy  can  recover  all  X  £  Tr  exactly  when  m  >  (2 r)K  +  2 nrK.  In  contrast, 
the  best  known  theoretical  guarantee  for  SNN  minimization,  due  to  Tomioka  et.  al.  [TSHKlll. 
shows  that  Xq  £  Tr  can  be  recovered  (or  accurately  estimated)  from  Gaussian  measurements  Q, 
provided  m  =  f 2(rnA  _1).  In  Section  [3J  we  prove  that  this  number  of  measurements  is  also  necessary. 
accurate  recovery  is  unlikely  unless  m  =  f^rn^-1).  Thus,  there  is  a  substantial  gap  between  an 
ideal  nonconvex  approach  and  the  best  known  tractable  surrogate.  In  Section  [4j  we  introduce  a 
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simple  alternative,  which  we  call  the  square  norm  model,  which  reduces  the  required  number  of 
measurements  to  0(AK/2)  nfK^2^ ).  For  K  >  3,  this  improves  by  a  multiplicative  factor  polynomial 
in  n. 

Our  theoretical  results  pertain  to  Gaussian  operators  Q.  The  motivation  for  studying  Gaussian 
measurements  is  twofold.  First,  Gaussian  measurements  may  be  of  interest  for  compressed  sens¬ 
ing  recovery  jDon06j .  either  directly  as  a  measurement  strategy,  or  indirectly  due  to  universality 
phenomena  1 )  11)1)1  IBLM12] .  Second,  the  available  theoretical  tools  for  Gaussian  measurements  are 
very  sharp,  allowing  us  to  rigorously  investigate  the  efficacy  of  various  regularization  schemes,  and 
prove  both  upper  and  lower  bonds  on  the  number  of  observations  required.  In  simulation,  our  qual¬ 
itative  conclusions  carry  over  to  more  realistic  measurement  models,  such  as  random  subsampling 
|LMWY09I  (see  Section  B-  We  expect  our  results  to  be  of  interest  for  a  wide  range  of  problems  in 
tensor  completion  ILMWY09],  robust  tensor  recovery  /  decomposition  |LYZY10l|GQ12|  and  sensing. 

Our  technical  methodology  draws  on,  and  enriches,  the  literature  on  general  structured  model 
recovery.  The  surprisingly  poor  behavior  of  the  SNN  model  is  an  example  of  a  phenomenon  first 
discovered  by  Oymak  et.  al.  [OJF+12]:  for  recovering  objects  with  multiple  structures,  a  combina¬ 
tion  of  structure-inducing  norms  is  often  not  significantly  more  powerful  than  the  best  individual 
structure-inducing  norm.  Our  lower  bound  for  the  SNN  model  follows  from  a  general  result  of  this 
nature,  which  we  prove  using  the  geometric  framework  of  |ALMT13l.  Compared  to  jO.TF+12 


our 


result  pertains  to  a  more  general  family  of  regularizers,  and  gives  sharper  constants.  In  addition,  we 
demonstrate  the  possibility  to  reduce  the  number  of  generic  measurements  through  a  new  convex 
regularizer  that  exploit  several  sparse  structures  jointly. 


2  Bounds  for  Non-Convex  Recovery 

In  this  section,  we  introduce  a  non-convex  model  for  tensor  recovery,  and  show  that  it  recovers 
low-rank  tensors  from  near-minimal  numbers  of  measurements.  While  our  nonconvex  formulation  is 
computationally  intractable,  it  gives  a  baseline  for  evaluating  tractable  (convex)  approaches. 

For  a  tensor  of  low  Tucker  rank,  the  matrix  unfolding  along  each  mode  is  low-rank.  Suppose 
we  observe  Q[X o]  £  Mm.  We  would  like  to  attempt  to  recover  Xo  by  minimizing  some  combination 
of  the  ranks  of  the  unfoldings,  over  all  tensors  X  that  are  consistent  with  our  observations.  This 
suggests  a  vector  optimization  problem  |BV04[  Chap.  4.7]: 


minimize^  rq;_  rk)  ranktc(Af)  subject  to  G[X\  =  Q[X o].  (2-1) 

In  vector  optimization,  a  feasible  point  is  called  Pareto  optimal  if  no  other  feasible  point  dominates 
it  in  every  criterion.  In  a  similar  vein,  we  say  that  (2.1 )  recovers  Xq  if  there  does  not  exist  any  other 
tensor  X  that  is  consistent  with  the  observations  and  has  no  larger  rank  along  each  mode: 


Definition  1.  We  call  Xq  recoverable  by  (2.1)  if  the  set 

{X'  ^  X„  I  g[X']  =  g[X0],  ranktc(*')  YR?  ranktc(AT0)}  =  0. 

This  is  equivalent  to  saying  that  Xq  is  the  unique  optimal  solution  to  the  scalar  optimization: 

minimize,*  max/  '  ~f  1  subject  to  G[X]=Q[X0].  (2.2) 

i  (  rank(AT0(i))  J 
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The  problems  ( 2.1 )-( 2.2 )  are  not  tractable.  However,  they  do  serve  as  a  baseline  for  understanding 


how  many  generic  measurements  are  required  to  recover  Xg-  The  recovery  performance  of  program 


(2.1)  depends  heavily  on  the  properties  of  Q.  Suppose  (2.1)  fails  to  recover  X0  G  Tr.  Then  there 
exists  another  X'  G  Tr  such  that  Q[X']  =  Q[X0],  So,  to  guarantee  that  recovers  any  X0  €  Tr , 
a  necessary  and  sufficient  condition  is  that  Q  is  injective  on  Tr,  which  can  be  implied  by  the  condition 
null((?)  fl  T2 r  =  {0}.  Consequently,  if  null (5)  D  T2r  =  {0},  (2.1)  will  recover  any  X0  €  Tr.  We 
expect  this  to  occur  when  the  number  of  measurements  significantly  exceeds  the  number  of  intrinsic 
degrees  of  freedom  of  a  generic  element  of  Tr,  which  is  0(rK  +  nrK).  The  following  theorem  shows 
that  when  m  is  approximately  twice  this  number,  with  probability  one,  Q  is  injective  on  Tr: 

Theorem  1.  Whenever  m  >  (2 r)K  +  2 nrK  +  1,  with  probability  one,  null(C?)  D  T^r  =  {0},  and 


hence  (2.1)  recovers  every  Xn  G  Tr 


Let 


The  proof  of  Theorem  [T]  follows  from  a  covering  argument,  which  we  establish  in  several  steps. 


0 


2r  =  {V\V€<Z2r,\\V\\F  =  l}. 


(2.3) 


The  following  lemma  shows  that  the  required  number  of  measurements  can  be  bounded  in  terms  of 
the  exponent  of  the  covering  number  for  ©2 r,  which  can  be  considered  as  a  proxy  for  dimensionality: 

Lemma  1.  Suppose  that  the  covering  number  for  ©2 r  with  respect  to  Frobenius  norm,  satisfies 


N(& 2 


If  i 


e)  <  (P/e)° 


(2.4) 


for  some  integer  d  and  scalar  (3  that  does  not  depend  on  e.  Then  if  m  >  d  +  1,  with  probability  one 
null  (G)  D  ©2r  =  0,  which  implies  that  null  (Q)  D  T2r  =  {0}. 

It  just  remains  to  find  the  covering  number  of  ©2r-  We  use  the  following  lemma,  which  uses  the 
triangle  inequality  to  control  the  effect  of  perturbations  in  the  factors  of  the  Tucker  decomposition 


[ [C ;  U 1 , U 2 ,  *  *  *  ,UK}}  ~Cx1U1  x2U2  x3---xkUk,  (2.5) 

where  the  mode-i  (matrix)  product  of  tensor  A.  with  matrix  B  of  compatible  size,  denoted  as  A  XiB, 
outputs  a  tensor  C  such  that  =  BA (q. 

Lemma  2.  Let  C,C'  G  Rri-->r*,  and  U1,U(  G  K”lXri, . . . ,  UK,  U'K  G  with  U*Ui  = 

U'*U\  =  I,  and  ||C||F  =  ||C'||F  =  1.  Then 

K 

\\[[C:U1,...1Uk}]-[[C/-1U(,...,U/k}\\\f  <  \\C-C'\\F  +  YJ\\U*-U'l\\  (2.6) 

4=1 

Using  this  result,  we  construct  an  e-net  for  ©2r  by  building  e/(K  +  l)-nets  for  each  of  the  K  +  1 
factors  C  and  {Ui}.  The  total  size  of  the  resulting  e  net  is  thus  bounded  by  the  following  lemma: 

Lemma  3.  N(62r,  ||-||F  , e)  <  (3(A' +  l)/e)^2? *  +2niK 

With  these  observations  in  hand,  Theorem  [l]  follows  immediately. 
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3  Convexification:  Sum  of  Nuclear  Norms? 


Since  the  nonconvex  problem  ( 2.1 )  is  NP-hard  for  general  Q ,  it  is  tempting  to  seek  a  convex  surrogate. 


In  matrix  recovery  problems,  the  nuclear  norm  is  often  an  excellent  convex  surrogate  for  the  rank 


Faz02,  RFP10,  CfoTTj.  It  seems  natural,  then,  to  replace  the  ranks  in  (|2.1[)  with  nuclear  norms, 
and  solve 


minimize  h(X)  :=  (||Af 


(i)l 


\X 


(2)1 


|Ar(if)||J  subject  to  Q[X]=Q[X0).  (3.1) 


Since  1 1  AT 1 1  ^  is  a  convex  function,  the  set  TL  :=  U x-g[x]=g[x0]  I  tl  h{X)}  is  convex. 
For  any  pareto  optimal  point  X ,  there  is  a  hyperplane  supporting  TL  passing  through  h(X),  with 
normal  vector  A  >  0.  Therefore,  X  is  an  optimal  solution  to  the  following  scalar  optimization: 


K 


minimize  2J  Ai||A’(i)||,  subject  to  Q[X]=Q[X0\. 


(3.2) 


i—1 


The  optimization  (|3.2|)  was  first  introduced  by  [LMWY09]  and  has  been  used  successfully  in 
applications  in  imaging  jSVdPDMSlll  ISHKM131 IKS131 ILL101 IGEK131  ILYZYlOj .  Similar  convex 
relaxations  have  been  considered  in  a  number  of  theoretical  and  algorithmic  works  |GRY1H  ISDS101 
ITHKIOi  ITSHKlH  ISTDLS13].  It  is  not  too  surprising,  then,  that  (|3.2[)  provably  recovers  the  un¬ 
derlying  tensor  Xq,  when  the  number  of  measurements  m  is  sufficiently  large.  For  example,  the 
following  is  a  (simplified)  corollary  of  results  of  Tomioka  et.  al.  |THK1O|0 


Corollary  2  (of  [THKIOj.  Theorem  3).  Suppose  that  Xq  has  Tucker  rank  (r,  ...,r),  and  m  > 
Crn1'-1 .  With  high  probability,  Xq  is  an  optimal  solution  to  (3.2),  with  each  A i  =  1.  Here,  C  is 
numerical. 


This  result  shows  that  there  is  a  range  in  which  (3.2)  succeeds:  loosely,  when  we  undersample 
by  at  most  a  factor  of  m/N  ~  r/n.  However,  the  number  of  observations  m  ~  rnK~l  is  significantly 
larger  than  the  number  of  degrees  of  freedom  in  Xq,  which  is  on  the  order  of  rK  +  nrK.  Is  it 
possible  to  prove  a  better  bound  for  this  model?  Unfortunately,  we  show  that  in  general  0(rnA_1) 


measurements  are  also  necessary  for  reliable  recovery  using  (3.2): 


Theorem  3.  Let  Xq  £  Tr  be 


nonzero. 


Set  k  =  min*  {||(Afo)(i)||*/||‘*'o||^}  X  nK  1.  Then  if  the 


number  of  measurements  m  <  k  —  2,  Xq  is  not  the  unique  solution  to  (3.2),  with  probability  at  least 
1  —  4  exp ( —  ^ ^ g j  )  ■  Moreover,  there  exists  Xq  £  Tr  for  which  k  =  rnA_1. 

This  implies  that  Corollary  [2]  (and  other  results  of  [THKIOj )  is  essentially  tight.  Unfortunately, 
it  has  negative  implications  for  the  efficacy  of  the  sum  of  nuclear  norms  in  (3.2):  although  a  generic 
element  Xq  of  Tr  can  be  described  using  at  most  rK  +  nrK  real  numbers,  we  require  fl(rnK~ 1) 


observations  to  recover  it  using  (3.2).  Theorem  [3]  is  a  direct  consequence  of  a  much  more  general 
principle  underlying  multi-structured  recovery,  which  is  elaborated  next. 


1Tomioka  et.  al.  also  show  noise  stability  when  m  =  Q ( rn K  1 )  and  give  extensions  to  the  case  where  the 
ranktc  (-Vo)  =  (ri, . . . ,  r k  )  differs  from  mode  to  mode. 
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Recovering  objects  with  multiple  structures 

The  poor  behavior  of  (|3.2|)  is  actually  an  instance  of  a  much  more  general  phenomenon,  first  dis¬ 
covered  by  Oymak  et.  al.  |OJF+12j.  Our  target  tensor  Af0  has  multiple  low-dimensional  structures 
simultaneously:  it  is  low-rank  along  each  of  the  I\  modes.  In  practical  applications,  many  other 
such  simultaneously  structured  objects  may  be  of  interest  -  for  example,  matrices  that  are  simulta¬ 
neously  sparse  and  low-rank  [.RSV121lQ.IF+12].  To  recover  such  a  simultaneously  structured  object, 
it  is  tempting  to  build  a  convex  relaxation  by  combining  the  convex  relaxations  for  each  of  the 
individual  structures.  In  the  tensor  case,  this  yields  (3.2 1 .  Surprisingly,  this  combination  is  often 


not  significantly  more  powerful  than  the  best  single  regularize!'  jOJF+12j.  We  obtain  Theorem  [3] 
as  a  consquence  of  a  new,  general  result  of  this  nature,  using  a  geometric  framework  introduced 
in  [ALMT13],  Compared  to  the  proof  strategy  in  |OJF+12].  this  approach  has  a  clearer  geometric 
intuition,  covers  a  more  general  class  of  regularizers  and  yields  sharper  bounds. 

Consider  a  signal  Xq  £  K"  having  K  low-dimensional  structures  simultaneously  (e.g.  sparsity, 
low-rank,  etc.Fl  Let  ||-||^  be  the  penalty  norms  corresponding  to  the  i-th  structure  (e.g.  £1,  nuclear 
norm).  Consider  the  composite  norm  optimization 

min  f(x)  :=  ||as||(i)  +  A2  ||a=||(2)  H - b'WIMIpr)  subject  to  Q[x]  =  G[x0],  (3.3) 

where  Q\-\  is  a  Gaussian  measurement  operator,  and  A  >  0.  Is  Xq  the  unique  optimal  solution  to 
(3.3)?  Recall  that  the  descent  cone  of  a  function  /  at  a  point  x0  is  defined  as 


C(f,  xq)  =  cone  {n  |  /( x0  +  v)  <  f(x„)}  , 


(3.4) 


which,  in  short,  will  be  denoted  as  C.  Then  Xq  is  the  unique  optimal  solution  if  and  only  if  null(C/)  n 
C  =  {0}.  Conversely,  recovery  fails  if  null((y)  has  nontrivial  intersection  with  C.  If  Q  is  a  Gaussian 
operator,  null(£/)  is  a  uniformly  oriented  random  subspace  of  dimension  n—m.  This  random  subspace 
is  more  likely  to  have  nontrivial  intersection  with  C  if  C  is  “large,”  in  a  sense  we  will  make  precise. 
The  polar  of  C  is  C°  =  cone(<9/(a;o)) .  Because  polarity  reverses  inclusion,  we  expect  that  C  will  be 
“large”  whenever  C°  is  “small”.  Figure [I] visualizes  this  geometry. 

To  control  the  size  of  C°,  first  consider  a  single  norm  ||-||0,  with  dual  norm  ||-||*.  Suppose  that 
||-||  is  L-Lipschitz:  ||a;||  <  L  ||a:||2  for  all  x.  Then  ||®||2  <  L  ||®||*  for  all  x  as  well.  Noting  that 


9  ||  jlo  (*)  =  1  (v,x)  =  ||a;||0  ,  I|u||*  <  1}  , 

for  any  v  G  d  |-  |0  (*0),  we  have 

(v^Xq)  ^  l|a=o||0  ^  Iloilo 

ll«ll2  ll^olla  ”  L\\v\\l  ll*o||2  ^  L\\x0\\2' 

(3.5) 

A  more  geometric  way  of  summarizing  this  is  as  follows:  for  x  ^  0,  let 

circ(ai,  9)  =  {z  \  Z(z,  x)  <  9}  , 

(3.6) 

and  denote  the  circular  cone  with  axis  x  and  angle  9.  Then  if  Xq  ^  0,  and  9  = 

cos-1  (|| aj0 1|0  /L  || a30 II 2) , 

<9 1|  jU  (*0)  C  circ  (aj0, 9) . 

(3.7) 

2xo  is  the  underlying  signal  of  our  interest  (perhaps  after  vectorization) . 
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Suppose  our  x0  has  two  sparse  structures  simultaneously.  Regularizer  |j-||^  has  a  larger  conic  hull  of 
subdifferential  at  x0 ,  i.e.  cone(9  ||*o||m))  which  results  in  a  smaller  descent  cone.  Thus  minimizing 
H'ILjn  is  more  likely  to  recover  x0  than  minimizing  ||-|k2).  Consider  convex  regularizer  f(x)  = 
|| aj0 II (i)  +  ||a:o||(2)-  Suppose  as  depicted,  0i  >  02 ■  Then  both  cone(9 H^olLr))  and  cone(9 ||x0||(2)) 
are  in  the  circular  cone  circ(a;0,  0{).  Thus  we  have:  cone(9/(*o))  =  cone(c?  ||aiol|(i)  +  9  ||*o||(2))  £ 
conv{circ(a:o,  6*i),  circ(a:o,  ^2)}  =  circ(a:o,  0i). 


Table[l]describes  the  angle  parameters  0  for  various  structure  inducing  norms.  Notice  that  in  general, 
more  complicated  x$  leads  to  smaller  angles  0.  For  example,  if  Xq  is  a  fc-sparse  vectors  with  entries 
all  of  the  same  magnitude,  and  ||-||  the  i1  norm,  cos2  0  =  k/n.  As  x0  becomes  more  dense,  9||-||^ 
is  contained  in  smaller  and  smaller  circular  cones. 

For  /  =  A»  || •!!(,),  notice  that  every  element  of  df(x o)  is  a  conic  combination  of  elements  of 
the  d  ||-||(i)  (*o)-  Since  each  of  the  d  ||-||(i)  (*o)  is  contained  in  a  circular  cone  with  axis  x0,  df(x 0) 
is  also  contained  in  a  circular  cone: 

Lemma  4.  Suppose  that  ||-||^  is  Li-Lipschitz.  For  x0  ^  0,  set  Qi  =  cos-1  ^||*o||(j)  /L  ||*o||2)  ■ 
Then 

df(xo)  C  circ  (x0,  max  0; )  .  (3.8) 


So,  the  subdifferential  of  our  combined  regularizer  /  is  contained  in  a  circular  cone  whose  angle  is 
given  by  the  largest  of  the  0j. 

How  does  this  behavior  affect  the  recoverability  of  x0  via  (3.3)?  The  informal  reasoning  above 
suggests  that  as  0  becomes  smaller,  the  descent  cone  C  becomes  larger,  and  we  require  more  measure¬ 
ments  to  recover  Xq.  This  can  be  made  precise  using  an  elegant  framework  introduced  by  Amelunxen 
et.  al.  [ALMT13].  They  define  the  statistical  dimension  of  the  convex  cone  C  to  be  the  expected 
norm  of  the  projection  of  a  standard  Gaussian  vector  onto  C: 


5(C)  =  E^JidAA(0>1)  \\Vc(g)\ 


(3.9) 
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Object 

Complexity  Measure 

Relaxation 

cos2  6 

K 

Sparse 

O 

II 

Mr 

M  fcl 

L  n  ’  n  J 

[l,k] 

Column-sparse  X  €  R”lX?l2 

c  =  ff{j  \  Xej  ^  0} 

E,l!^ll2 

[  —  -£-l 

1 no’’  no‘ 

[ni,cni] 

Low-rank  X  £  R™lXn2  (ni  >  „2) 

r  =  rank(W) 

m. 

L  no.  ’  no  J 

[ni,rni] 

Low-rank  AT  e  R"xnx' 

ranktc  (X) 

Ei  xa)  * 

fl  r] 

L  n  ’  n  J 

[nK~1,rnK~1] 

ranktc  (X) 

\\x  In 

i  K  K 

[(^J,(£)L^J1 

rAi  1  A  1  rki 

[: n 1  2  1 , rL  2  Jn1  2  ] 

Table  1:  Concise  models  and  their  surrogates.  For  each  norm  ||-||  ,  the  third  column  describes 
the  range  of  achievable  angles  6.  Larger  cos  9  corresponds  to  a  smaller  C°,  a  larger  C,  and  hence  a 
larger  number  of  measurements  required  for  reliable  recovery.  The  fourth  line  is  the  sum  of  nuclear 
norms;  the  last  line  is  the  square  norm  introduced  in  Section  [4] 


Using  tools  from  spherical  integral  geometry,  |ALMT13]  shows  that  for  linear  inverse  problems  with 
Gaussian  measurements,  a  sharp  phase  transition  in  recoverability  occurs  around  m  =  (5(C).  We  will 
need  only  one  side  of  their  result;  for  more  details  see  [ALMT13j.  We  state  a  slight  variant  here: 

Corollary  4.  Let  Q  :  R"  — >  Rm  be  a  Gaussian  operator ,  and  C  a  convex  cone.  Then  if  m  <  5(C), 

P[Cnnull(£?)  =  {0}  ]  <  4 exp  '  (3-10) 

To  apply  this  result  to  our  problem,  we  lower  bound  the  statistical  dimension  5(C),  of  the  descent 
cone  C  of  /  at  Xq.  Using  the  Pythagorean  theorem,  monotonicity  of  <5(-),  and  Lemma[4j  we  calculate 

<5(C)  =  n  —  5(C°)  =  n  —  5  (cone(d f  (x0)))  >  n  —  <5(circ(xo,  max#j)).  (3-11) 


Moreover,  using  the  properties  of  statistical  dimension,  we  are  able  to  prove  an  upper  bound  for 
the  statistical  dimension  of  circular  cone,  which  improves  the  constant  in  existing  results  |ALMT13l 


IMcCISI. 


Lemma  5.  <5(circ(aio,  9))  <  nsin2  9  +  2. 

Finally,  by  combining  (3.11)  and  Lemma  [5j  we 
[4j  we  obtain: 

Theorem  5.  Let  x0  ^  0.  Suppose  that  for  each  i, 


have  5(C)  >  n  min^  cos2  0i 
||-|U)  is  Li-Lipschitz.  Set 


2.  Using  Corollary 


Hi 


n\\xo\\ 

Li  Ikolla 


ncos2(0i), 


and  k  =  mini  k i  ■  Then  if  m  <  k  —  2, 


/  (k  —  m  —  2)2\ 

V  16  («  -  2)  )■ 


P  [x0  is  the  unique  optimal  solution  to  (3.3)  ]  <  4 exp 


(3.12) 


Thus,  for  reliable  recovery,  the  number  of  measurements  needs  to  be  at  least  proportional  to  k[^] 
Notice  that  k  =  min^  kt  is  determined  by  only  the  best  of  the  structures.  Per  Table  [l]  sq  is  often  on 
the  order  of  the  number  of  degrees  of  freedom  in  a  generic  object  of  the  i- th  structure.  For  example, 
for  a  k- sparse  vector  whose  nonzeros  are  all  of  the  same  magnitude,  n  =  k. 

Theorem  [5|  together  with  Table  |T|  leads  us  to  the  phenomenon  that  recently  discovered  by  Oymak 
et.  al.  fOJF+12):  for  recovering  objects  with  multiple  structures,  a  combination  of  structure- inducing 
norms  tends  to  be  not  significantly  more  powerful  than  the  best  individual  structure-inducing  norm. 
As  we  demonstrate,  this  general  behavior  follows  a  clear  geometric  interpretation  that  the  subdif¬ 
ferential  of  a  norm  at  Xq  is  contained  in  a  relatively  small  circular  cone  with  central  axis  Xq. 

We  can  specialize  Theorem  [5]  to  low-rank  tensors  as  follows:  if  AT  is  a  A'-mode  n  x  n  x  •  •  •  x  n 
tensor  of  Tucker  rank  (r,  r, ...  ,r),  then  for  each  i,  ||Af||^  =  1 1  Af 1 1 ^  is  L  =  y^-Lipschitz.  Hence, 

k  =  min  { ||  Af(i)  ||^  /  ||  AT ||p}  nK~1.  (3.13) 

The  term  in  brackets  lies  between  1  and  r,  inclusive.  For  example,  if  AT  =  [[C,  U\, . . . ,  U k)\i  with 
UjUi  =  I  and  C  supersymmetric  (Ci1...iK  =  li1=j2=...=iJf ),  then  this  term  is  equal  to  r. 

4  A  Better  Convexification:  Square  Norm 

The  number  of  measurements  promised  by  Corollary  [2]  and  Theorem  [3]  is  actually  the  same  (up  to 
constants)  as  the  number  of  measurements  required  to  recover  a  tensor  Afo  which  is  low-rank  along 
just  one  mode.  Since  matrix  nuclear  norm  minimization  correctly  recovers  a  n\  x  ri2  matrix  of  rank 
r  when  m  >  Cr[n\  +  712)  CIll'W  12] .  solving 

minimize  ||  AT subject  to  Q[X]=Q[X0  }  (4-1) 

also  exactly  recovers  X0  with  high  probability  when  m  >  Crn1^^1 . 

This  suggests  a  more  mundane  explanation  for  the  difficulty  with  (|3.2|):  the  term  rnK comes 
from  the  need  to  reconstruct  the  right  singular  vectors  of  the  n  x  nK~L  matrix  X^y  If  we  had 
some  way  of  matricizing  a  tensor  that  produced  a  more  balanced  (square)  matrix  and  also  preserved 
the  low-rank  property ,  we  could  substantially  reduce  this  effect,  and  reduce  the  overall  sampling 
requirement.  In  fact,  this  is  possible  when  the  order  K  of  Xq  is  four  or  larger. 

For  A  £  RmiX"i,  and  integers  m 2  and  712  satisfying  m\n\  =  7712712,  the  reshaping  operator 
reshape(A, m2, 712)  returns  a  m2  x  712  matrix  whose  elements  are  taken  columnwise  from  A.  This 
operator  rearranges  elements  in  A  and  leads  to  a  matrix  of  different  shape.  In  the  following, 
we  reshape  matrix  Af(i)  to  a  more  square  matrix  while  preserving  the  low-rank  property.  Let 
AT  £  gnixnax-xnjf^  Select  j  £  [A']  :=  {1,  2,  •  •  •  ,  AT}.  Then  we  define  matrix  Afyj  as 

j  K 

X [j]  =  reshape ( Af  (i) ,  JJ  rij ,  n,  J  . 

i— 1  i=j-\- 1 

We  can  view  X  yj  as  a  natural  generalization  of  the  standard  tensor  matricization.  When  j  =  1, 
X [y  is  nothing  but  X(iy  However,  when  some  j  >  1  is  selected,  X becomes  a  more  balanced 
matrix.  This  reshaping  also  preserves  some  of  the  algebraic  structures  of  AT.  In  particular,  we  will 
see  that  if  AT  is  a  low-rank  tensor  (in  either  the  CP  or  Tucker  sense),  X [y  will  be  a  low-rank  matrix. 

’‘E.g.,  if  m  =  (k  —  2)/2,  the  probability  of  success  is  at  most  4  exp( —(/■;.  —  2) / 64) . 
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Lemma  6.  (1)  If  X  has  CP  decomposition  X  =  Y^i=  1  o  a  -2^  o  •  •  •  o  a\K\  then 

X[j]  =  ^2  ®  aF_1)  ®  ■  ■  ■  ®  a^)  o  (a-^  0  •  0  ap+1)).  (4.2) 

i=l 

(S)  //AT  Las  Tucker  decomposition  X  =  C  Xi  l/j  x2  t/2  x3  •  •  •  Xjf  t/ r-,  f/ien 

Afw  =  (l/j  0  Uj-i  0  •  •  •  0  U i)  Cyj  (UK  0  C/^-i  0  •  •  •  0  C7i+i)*.  (4.3) 

Using  Lemma  [6]  and  the  fact  that  rank  (A  0  B)  =  rank(A)  rank(B),  we  obtain: 

Lemma  7.  Let  ranktc  (AT)  =  (ri,r2,---  ,rx),  and  rankcp  (AT)  =  rcp.  TLen  rank(Af[ji)  <  rcp,  and 
rank(AT[j])  <  min|  IlLiU,  Ylt=j+iri  }• 

Thus,  X m  is  not  only  more  balanced  but  also  maintains  the  low-rank  property  of  tensor  X.  In 
the  following,  we  show  how  this  new  matricization  can  lead  to  better  relaxations  for  tensor  recovery. 
For  ease  of  discussion,  we  assume  X  has  the  same  length  (say  n)  along  each  mode  and  has  Tucker 
rank  (r,  r,  •  ■  ■  ,  r).  We  write  Xq  =  X^k^  and  call  || Af ||q  :=  || ATq || a  the  square  norm  of  tensor  X. 
Since  Xq  is  low-rank,  we  can  attempt  to  recover  X  by  solving 


minimize  || A' || □  subject  to  G[X\  =  G[X0]. 


(4.4) 


Using  Lemma[7]and  Proposition  3.11  of  |CRPW12).  we  can  prove  that  this  relaxation  exactly  recovers 
Xq,  when  the  number  of  measurements  is  sufhcienly  large: 


Theorem  6.  (1)  If  Xq  has  CP  rank  r,  using  (4.4|,  m  >  Cm ^  2^  is  sufficient  to  recover  Xq  with 
high  probability.  (2)  If  Xq  has  Tucker  rank  (r,  r,  •  •  •  ,  r),  using  (4.4),  m  >  ^  is  sufficient 

to  recover  Xq  with  high  probability. 

Compared  with  ^Ifrn1'"^1)  measurements  required  by  the  sum-of-nuclear-norms  model,  the  sam¬ 
ple  complexity,  0(rL^ Jn^sd),  required  by  the  square  reshaping  (4.4),  is  always  within  a  constant 


of  it,  much  better  for  small  r  and  K  >  4  -  e.g.,  by  a  multiplicative  factor  of  when  r  is  a 

constant.  This  is  a  significant  improvement.  However,  there  are  also  two  clear  limitations.  First,  no 
improvement  is  obtained  for  the  case  K  =  3.  Second,  the  improved  sample  complexity  in  Theorem 
[6]  is  still  suboptimal  compared  to  the  nonconvex  model  (2.1). 

It  is  also  worth  noting  that  for  tensors  with  different  lengths  or  ranks,  Theorem  [3]  and  Theorem 
[6]  can  be  easily  modified.  It  remains  true  that  for  a  large  class  of  tensors,  our  square  reshaping 
is  capable  of  reducing  the  number  generic  measurements  required  by  SNN  model.  However,  the 
comparison  between  sum-of-nuclear-norms  and  square  norm  becomes  quite  subtle  then.  Concrete 
instances  can  be  definitely  constructed  so  that  square  norm  model  does  not  have  any  advantage  over 
the  SNN  model  even  for  K  >  3  (e.g.  a  tensor  of  size  1000  x  10  x  10  x  10  with  Tucker  rank  (1, 1, 1, 1)). 
On  the  other  hand,  our  square  norm  model  can  sometimes  be  blessed  by  unbalanced  tensors.  For 
example,  consider  a  tensor  of  size  4  x  9  x  12  x  3  with  Tucker  rank  (2,  2, 1, 1).  Then  our  reshaping 
matrix  is  a  36  x  36  square  matrix  with  rank  1,  which  is  a  matrix  with  very  good  (perfect)  conditions. 
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Size  of  tensor  (n) 


5  Simulation  Results  for  Tensor  Completion 


Tensor  completion  attempts  to  reconstruct  the  low-rank  tensor  X3  based  on  observations  over  a 
subset  of  its  entries  il .  By  imposing  appropriate  incoherence  conditions  (and  modifying  slightly 
arguments  in  [Gro  m  it  is  possible  to  prove  recovery  guarantees  for  each  of  the  following  programs: 


K 


minimize  ^  Xj \\X^)  [* 

A _ 1 

subject  to 

Vn[X\=Vn[X0  ]; 

(5.1) 

Z —  1 

minimize  ||Af||[-] 

subject  to 

Vn[X]=Vn[X0\. 

(5.2) 

Unlike  the  recovery  problem  under  Gaussian  random  measurements,  due  to  the  lack  of  sharp  upper 
bounds,  we  have  no  proof  that  our  square  norm  formulation  outperforms  the  SNN  model  here. 
However,  our  simulation  results  below  strongly  suggest  that  (5.2)  also  performs  much  better  than 
(5.1)  for  tensor  completion  case. 


Tensor  completion  with  Square  Norm  minimization 


Tensor  completion  with  SNN  minimization 


0.02  0.04  0.06  0.08  0.1  0.12  0.14  0.16  0.18  0.2  0.02  0.04  0.06  0.08  0.1  0.12  0.14  0.16  0.18  0.2 


Fraction  of  entries  observed  (rho)  Fraction  of  entries  observed  (rho) 

Figure  2:  Tensor  completion.  The  colormap  indicates  the  fraction  of  correct  recovery,  which 
increases  with  brightness  from  certain  failure  (black)  to  certian  success  (white). 


Our  experiment  is  set  up  as  follows.  We  generate  a  4 

X0  =  [[C0;  U r,  U2,  U3,  UA]]  =  C0x1U1x2  U2  x3  U3  x4  U4, 


where  the  core  tensor  Co  €  Rlxlx2x2  has  i.i.d.  standard  Gaussian  entries,  and  matrices  U i,  U2  € 
R”xl  and  matrices  U3,  U 4  €  Krax2,  satisfying  U*U \  =  J,  are  drawn  uniformly  at  random  (by  the 
command  orth(randn(-,  •))  in  Matlab).  The  observed  entries  are  chosen  uniformly  with  ratio  p.  We 
increase  the  problem  size  n  from  10  to  30  with  increment  1,  and  the  observation  ratio  p  from  0.01 
to  0.2  with  increment  0.01.  For  each  (p,  n)- pair,  we  simulate  5  test  instances  and  declare  a  trial 

ll**-*ollF  ^  m-2 


to  be  successful  if  the  recovered  X  satisfies 


\\xn 


<  10  2 .  The  optimization  problems  are 


solved  using  efficient  first-order  methods.  Since  (5.2)  is  in  the  form  of  standard  matrix  completion. 


we  use  the  Augmented  Lagrangian  Method  (ALM)  proposed  in  |LCM10j  to  solve  it.  For  the  sum 
of  nuclear  norms  minimization  (5.1)  with  A,;  =  1 ,  we  implement  the  accelerated  linearized  Bregman 
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algorithm  |HMG13i.  of  which  we  include  a  detailed  discussion  in  the  appendix.  Figure  [2]  plots  the 
fraction  of  correct  recovery  for  each  pair  (black  =  0%  and  white  =  100%).  Clearly  much  larger 
white  region  is  produced  by  square  norm,  which  empirically  suggests  that  (|5.2[)  outperforms  (|5.1[) 
for  tensor  completion  problem. 


6  Conclusion 


In  this  paper,  we  establish  several  theoretical  bounds  for  the  problem  of  low-rank  tensor  recovery 
using  random  Gaussian  measurements.  For  the  nonconvex  model  B  we  show  that  ((2 r)K  + 
2 nrK  +  l)  measurements  are  sufficient  to  recover  any  X0  £  Tr  almost  surely.  We  highlight  that 
though  the  nonconvex  recovery  program  is  NP-hard  in  general,  it  does  serve  a  baseline  for  evaluating 
tractable  (convex)  approaches.  For  the  conventional  convex  surrogate  sum-of-nuclear-norms  (SNN) 
model  (3.2),  we  prove  a  necessary  condition  that  fl(rnA_1)  Gaussian  measurements  are  required  for 


reliable  recovery.  This  lower  bound  is  derived  from  our  study  on  multi-structured  object  recovery 
under  a  very  general  setting,  which  can  be  applied  to  many  scenarios.  To  narrow  the  apparent 
gap  between  the  non-convex  model  and  the  SNN  model,  we  unfold  the  tensor  into  a  more  balanced 


matrix  while  preserving  its  low-rank  property,  leading  to  our  square-norm  model  (4.4).  We  prove 
that  0(rL tJ/iTtI)  measurements  are  sufficient  to  recover  a  tensor  Xq  £  Tr  with  high  probability. 
Though  the  theoretical  results  only  pertain  to  Gaussian  measurements,  our  simulation  result  for 
tensor  completion  also  suggests  that  square-norm  model  outperforms  the  SNN  model. 


Model 

sample  complexity 

non-convex 

(2  r)K  +  2  nrK  +  1 

SNN 

0(rnA-i) 

square-norm 

1) 

Table  2:  Summary  of  results  derived  in  our  paper. 


Compared  with  f l(rnK  1)  measurements  required  by  the  sum-of-nuclear-norms  model,  the  sam¬ 
ple  complexity,  0{r^^n^  1),  required  by  the  square  reshaping  (4.4),  is  always  within  a  constant 
of  it,  much  better  for  small  r  and  K  >  4.  Although  this  is  a  significant  improvement,  compared 


with  the  nonconvex  model  (2.1),  the  improved  sample  complexity  achieved  by  square  norm  model 


is  however  still  suboptimal.  It  remails  an  open  problem  to  obtain  near-optimal  convex  relaxations 
for  all  K  >  2. 

More  broadly  speaking,  to  recover  objects  with  multiple  structures,  regularizing  with  a  combi¬ 
nation  of  individual  structure-inducing  norms  is  proven  to  be  substantially  suboptimal  (Theorem  [5] 
and  also  |O.TF+12]h  The  resulting  sample  requirements  tend  to  be  much  larger  than  the  intrinsic 
degrees  of  freedom  of  the  low-dimensional  manifold  that  the  structured  signal  lies  in.  Our  square- 
norm  model  for  the  low-rank  tensor  recovery  demonstrates  the  possibility  that  a  better  exploitation 
in  those  structures  can  significantly  reduce  this  sample  complexity.  However,  there  are  still  no  clear 
clues  on  how  to  intelligently  utilize  several  simultaneous  structures  generally,  and  moreover  how  to 
design  tractable  method  to  recover  multi-structured  objects  with  near  minimal  number  of  measure¬ 
ments.  These  problems  are  definitely  worth  pursuing  in  future  study  and  we  hope  that  our  work 
may  also  inspire  researchers  working  in  many  other  multi-structured  recovery  problems. 
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A  Proofs  for  Section  [2] 

Proof  of  Lemma  [lj 

Proof.  The  arguments  we  used  below  are  primarily  adapted  from  [ENPllj.  where  their  interest  is 
to  establish  the  number  of  Gaussian  measurements  required  to  recover  a  low  rank  matrix  by  rank 
minimization. 

Notice  that  every  T>  £  ©2 r,  and  every  i,  ( ,  T>)  is  a  standard  Gaussian  random  variable,  and  so 


vt>o,  P[\(g^v)\<t}<2t- 


1 


=  t 


(A.l) 


Let  91  be  an  e-net  for  ©2r  in  terms  of  ||-||F.  Because  the  measurements  are  independent,  for  any 
fixed  T>  £  ©2r, 

P[||G[©]|L  <t]  <  (ty/2fr)  .  (A. 2) 


Moreover,  for  any  T?  £  ©2r,  we  have 

\\sm\oo  >  \\s[v]\\oo-Mf^«xP-v\\f} 


T>£<n 


(A.3) 

(A.4) 


Hence, 


inf  ||G[©]||00  <elog(l/e) 

^tD2r 


< 


min  ||^[X>]||00  <  2e log(l/e) 


■V[\\S\\ 


F—>  oo 


>  log(l/e)] 


<  #91  x  (2y/2j^x  elog(l/e))  +  P  [  ||S||F-kx,  >  l°g(l/e)  ] 

<  /3d(2\/2A)m£m-d  log(l/e)m  +  P  [  >  log(l/e)  ] . 


(A.5) 


Since  m  >  d+  1,  (A.5 1  goes  to  zero  as  e  \  0.  Hence,  taking  a  sequence  of  decreasing  e,  we  can  show 
that  P  [infx>ee 2r  ||!7[iE,]||00  =  0]  <  t  for  every  positive  t,  establishing  the  result.  □ 

Proof  of  Lemma  [2j 

Proof.  This  follows  from  the  basic  fact  that  for  any  tensor  X  and  matrix  U  of  compatible  size, 

\\XxkU\\F  <  \\U\\\\X\\F,  (A. 6) 

which  can  be  established  by  direct  calculation.  Write 
\\[[C-1U1,...,Uk}}-[[C';U'1,...,U'k}\\\f 
<  \\[[C;U1,...,Uk}\-[[C';U1,...,Uk}\\\f 

K 

]T[[C';  U[, . . . ,  Ul  Ui+1, . . .  Uk]}  -  [[C7;  U[, . . . ,  U’^Ui, . . .  UK\] 

2—1 

K 


< 


i— 1 
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where  the  first  inequality  follows  from  triangle  inequality  and  the  second  inequality  follows  from  the 
fact  that  ||C||F  =  1,  ||£7j||  =  1,  U*Ui  =  I  and  U'*U\  =  I.  □ 


Proof  of  Lemma  03 

Proof.  The  idea  of  this  proof  is  to  construct  a  net  for  each  component  of  the  Tucker  decomposition 
and  then  combine  those  nets  to  form  a  compound  net  with  the  desired  cardinality. 

Denote  €  =  {C  G  R2rx2rx-x2r  |  ||C||^  =  and  Q  =  {U  G  Knxr  I  U*U  =  I}.  Clearly,  for 
any  C  G  £,  ||C||F  =  1,  and  for  any  U  G  O,  ||C/||  =  1.  Thus  by  Verf)7.  Prop.  4]  and  iiVeiTO. 
Lemma  5.2],  there  exists  an  FF1-net  £7  covering  £  with  respect  to  the  Frobenius  norm  such  that 
#£'  <  l)(2r)K ,  and  there  exists  an  ^j-net  O'  covering  O  with  respect  to  the  operator  norm 

such  that  #0'  <  (^K+Pynr.  Construct 

e'2r  =  {[[c'-,u'1,...,u'K]]  \C'Gtf,  U[  G  O'}. 

(  \  (2r)K+2nrK 

Clearly  r  <  (  _Lr — ~ )  •  The  rest  is  to  show  that  &2r  is  indeed  an  e-net  covering  &2r 

with  respect  to  the  Frobenius  norm. 

For  any  fixed  V  =  ,U  k]\  G  &2r  where  C  G  £  and  Ui  G  O,  by  our  constructions 

above,  there  exist  C'  G  £'  and  U(  €  O'  such  that  || C  —  C'  ||F  <  and  ||  Ui  —  C/-||  <  3^Ag+1^ . 

Then  X>'  =  [[C';  U\ .  ■  ■  ■  ,  U'K]]  G  &2r  is  within  e-distance  from  X>,  since  by  the  triangle  inequality 
derived  in  Lemma  2,  we  have 

K 

II©  -  ©'L  =  II  [[C;  Uu  . . . ,  UK]]  -  [[C'j  U[, . . . ,  U'k]]\\f  <  ||  C  -  C'||F  +  £  -  U[\\  <  e. 

i= 1 

This  completes  the  proof.  □ 


B  Proofs  for  Section  [3] 

Proof  of  Corollary  |4} 

Proof.  Denote  A  =  5(C)  —  m.  Then  following  [ALMT131  Thm.  7.1],  we  have 

A2/8 


[Cnnull(£0  =  {0}]  <  4 exp  — 

<  4  exp 

<  4  exp  (  — 


min{5(C),  5(C°)}  +  A 
A2/8 


5(C)  +  A 
(5(C)  —  m)2 
165(C) 


□ 
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Proof  of  Lemma  [5] 

Proof.  Denote  circ(en,  8)  as  circ n(Q),  where  e„  is  the  nth  standard  basis  for  Rn.  Since  d(circ(a:o,  0))  = 
<5(circ(en,0)),  it  is  sufficient  to  prove  d(circ„(0))  <  nsin2  8  +  2. 

Let  us  first  consider  the  case  where  n  is  even.  Define  a  discrete  random  variable  V  supporting  on 
{0, 1,2,  •••  ,n}  with  probability  mass  function  P  [V  =  k]  =  Vk-  Here  Vk  denotes  the  fc-th  intrinsic 
volumes  of  circ„(0).  As  specihed  in  [Amelll  Ex.  4.4.8],  we  have 

Vk  =  \  ^  sinfe_1(0)cosn_fe_1(0)  for  k  =  1,2,  •  •  •  ,n  -  1. 

From  |ALMT131  Prop.  5.11],  we  know  that 

n 

<5(circ„(0))  =  E[V]  =  ^P[H>  k] . 

k= 1 


Moreover,  by  the  interlacing  result  from  IALMT131  Prop.  5.6]  and  the  fact  that  P  [V  >  2 k)  = 
P  [V  >  2k  ^  1] —P  [V  =  2k  —  1  ] ,  we  have 


p 

>  1] 

< 

2P 

[H  = 

1]  +2P 

p 

[V  >2] 

< 

P 

[V  = 

1]  +2P 

p 

[H>3] 

< 

2P 

[H  = 

3]  +2P 

p 

[H>4] 

< 

P 

[H  = 

3]  +2P 

[V'  =  3]+---+2P[V  =  n-l], 
[V  =  3  ]  +  ••■  +  2P  [  V  =  n  —  1  ] ; 

[V  =  5 ]  H - +  2F[V  =  n— 1], 

[V  =  5 ]  +  •••  +  2P  [ V  =  n  —  1  ] ; 


P[V  >n—l]  <  2P  [  V  =  n  -  1  ] , 

P[H>n]  <  P[V  =  n  —  1] . 

Summing  up  the  above  inequalities,  we  have 

n 

E[V]  =  ^P[H>fc] 

k= 1 

<  ^2  2(fc  ~  1)^  +  ^2  3yfe 

fe=l,3, —  ,7i— 1  k= 1 ,3, —  ,7i—  1 

3  n 

<  (n  -  2)  sin2  8  +  -  ^  vk 

k= 0 

3  1 

<  (n  —  2)  sin2  8  +  -  =  n  sin2  8  +  2  cos2  8  —  - , 

where  the  second  last  inequality  follows  the  observations  that  J2k= l  3  •••  n-i  ^~2'(^vk)  =  E  sin2  0)] 

and  Y^k= o  vk  >  Efc=i  3  ...  n—  l  2vfe  again  by  the  interlacing  result  [ALMT131  Prop.  5.6]. 

Suppose  n  is  odd.  Since  the  intersection  of  circn+i(0)  with  any  ?r-dimensional  linear  subspace 
containing  en+i  is  an  isometric  image  of  circ„(0),  by  IALMT131  Prop.  4.1],  we  have 

<5(circn(0))  =  <5(circn(0)  x  {0})  <  5(circn+i(0))  <  (n  +  1)  sin2  8  +  2  cos2  8  —  ^  <  ?rsin2  8  +  cos2  8  + 
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Thus,  taking  both  cases  (n  is  even  and  n  is  odd)  into  consideration,  we  have 


<5(circ„(0))  <  n  sin2  9  +  cos2  9  + 


□ 


Proof  of  Theorem  [5j 

Proof.  Notice  that  for  any  fixed  m  >  0,  the  function  f  :  t  — >  4  exp  ''j  is  decreasing  for 

t  >  m.  Then  due  to  Corollary  4  and  the  fact  that  5(C)  >  n  —  2  >  to,  we  have 


1  [xq  is  the  unique  optimal  solution  to  (3.3)  ]  = 


< 


< 


CnnuH(0)  =  {0}] 
( 5(C)  —  to)2 


4  exp  — 


16<y(C) 


4  exp 


(k  —  to  —  2)2  \ 

16  (k  —  2)  J  ‘ 


□ 


C  Proofs  for  Section  |4] 

Proof  of  Lemma  [6j 

Proof.  (1)  By  the  definition  of  X  -y ,  it  is  sufficient  to  prove  that  the  vectorization  of  the  right  hand 
side  of  (4.2)  equals  vec(AT(1)). 

Since  X  =  Y^i=i  A ia[ ^  o  a'P  o  ■  ■  ■  o  a\K\  we  have 

r 

vec(Af(1))  =  vec(  ^  Kaf )  o  (a\K^  0  a-X~1)  0  •  •  •  <8  a-2))) 

i— 1 

=  ^2  Aivec(a-1)  o  (a-A)  0  a|A_1)  0  •  •  •  0  a-2))) 

*=i 

=  ^  Ai(a-A)  0  a[A_1)  0  •  •  ■  0  a-2)  ®  af }), 

i= 1 

where  the  last  equality  follows  from  the  fact  that  vec(a  o  b)  =  b  ®  a.  Similarly,  we  can  derive  that 
the  vectorization  of  the  right  hand  side  of  (4.2), 


vec(^  Ai(a|j)  ®  a\3  ®  •  • 

i= 1 

•  0  a-x))  o  (a^  ®  a-^  ^  •  ■ 

•  0  a-J+1) 

y  Alvec((ap)  ®  a^_1)  0  • 

•  •  0  a-x))  o  (a-1^  0  a-A_1)  • 

•  •  ®  ap+1) 

=  ^2  \i(a[K)  ®  a[K  1}  0  0  a^) 

i= 1 

=  vec(AT  (!)). 
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Thus,  equation  (4.2)  is  valid. 


(2)  The  above  argument  can  be  easily  adapted  to  prove  the  second  claim.  Since  X  =  C  X\U\  x2 
U2x3---xkUk,^  have 

vec(A’(1))  =  vec^t/i  C(1)  (U  K  0  UK-i  <S>  ■  ■  ■  0  U2)*^j 
=  {U k  0  UK-i  0  •  •  ■  0  Ui)  vec(C(i)), 

where  the  last  equality  follows  from  the  fact  that  vec (ABC)  =  ( C *  0  A)vec(B).  Similarly,  we  can 
derive  that  the  vectorization  of  the  right  hand  side  of  (4.3), 

vec^(Uj  0  Uj-i  0  •  •  •  0  Ui)  C[j]  ( Uk  0  thc-i  0  ■  •  •  0  UJ+i)*^j 

=  (UK  0  l/jf-i  0  •  •  •  0  C/i)  vec(Cbl) 

=  (UK  0  Ehr_i  0  ■  ■  •  0  t/i)  vec(C(1)) 

=  vec(AT(1)). 

Thus,  equation  (4.3)  is  valid.  □ 

D  Algorithms  for  Section  [5] 

In  this  section,  we  will  discuss  in  detail  our  implementation  of  accelerated  linearized  Bregman  algo¬ 
rithm  for  the  following  problem: 


K 


minimize,*  X]lA(i)||*  subject  to  Vn[X]  =Vn[XQ\. 


(D.l) 


»= 1 


By  introducing  auxiliary  variable  VV  and  splitting  X  into  X\,  Af2,  ■  ■  ■  ,  X k,  it  can  be  easily 


verified  that  problem  (D.l)  is  equivalent  to 


K 


minq^yw)  ^||(A'i)(i 

t= 1 

s.t. 


Xt=w,  i  =  1,2,- 
Pn[W]  =Pn[AT0], 


(D.2) 


whose  objective  function  is  now  separable. 

The  accelerated  linearized  Bregman  (ALB)  algorithm,  proposed  in  )HMG13j.  is  an  efficient  first- 
order  method  designed  for  solving  convex  optimization  problems  with  nonsmooth  objective  functions 
and  linear  constraints.  It  has  been  successfully  applied  to  solve  U  and  nuclear  norm  minimization 
problems  [HMG13L  The  ALB  algorithm  solves  nonsmooth  problem  by  firstly  smoothing  the  ob¬ 
jective  function  (e.g.  adding  a  small  Z2  perturbation),  and  then  exploiting  Nesterov’s  accelerated 
scheme  [Nes83j  to  the  dual  problem,  which  can  be  verified  to  be  unconstrained  and  Lipschitz  differ¬ 
entiable.  In  Algorithm  [l]  we  describe  our  ALM  algorithm  adapted  to  problem  (D.2|).  Algorithm  [I] 
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solves  exactly  the  smoothed  version  of  problem  (D.2|: 


K  f  i  \  i 


s.t.  X,=W,  i  =  l,2,---,K, 

Vn[W]=Vn[X0], 


(D.3) 


where  we  denote  3b  as  the  dual  variable  for  the  constraint  Xi  =  W  and  denote  Z  as  the  dual 


variable  for  the  last  constraint  "PopW]  =  Vn[Xa],  Since  the  objective  function  in 
each  setup  of  the  ALB  algorithm  is  easy  to  solve  as  we  can  see  from  Algorithm  ] 

(D.3)  is  separable, 

Algorithm  1:  accelerated  linearized  Bregman  algorithm  for  SNN  model  ( 

D.l ) 

1  Initialization:  y®  =  yi  =  0  for  each  i  £  [A'],  ZQ  =  Z  =  0,  n  >  0,  r  >  0,  to  =  1; 

2  for  k  =  0,  1,  2,  •  •  •  do 

3 

4 


9 

10 


12 


for  i  =  1,  2,  •  •  •  ,  A  do 
L  xi+1  =  V  •  Shrinkage (3^ ,  1)  ; 

2*1 -  Li*?); 


Wfc+1  =  /X  ■  (vn 

for  i  =  l,  2,  •  •  •  ,  A  do 


=  yki-T-[x*+1-w 


Z  =Zk-r-Vn 

_  1+v/l+4 1\  m 
tfc+1  2  1 

for  i  =  l,  2,  •  •  •  ,  A  do 


Wfc+1  -  XQ 


yfc  +  i  = 

1.  ^  l 


tk~ 

tk+ 


}(y!-yr 


Zk+ 1  =  Z  +  ^ 

tk  + 


1  (zk  -  z1-1 


For  our  numerical  experiment  (A  =  4),  we  choose  smoothing  parameter  /i  =  50 1|  ATo||f  and  step 
size  r  =  X  Empirically,  we  observe  that  larger  values  of  /i  do  not  result  in  a  better  recovery 
performance.  This  is  consistent  with  the  theoretical  results  established  in  !LY12l  IZCCZll] . 


4  The  Shrinkage  operator  in  line  4  of  Algorithm  [I]  performs  the  regular  shrinkage  on  the  singular  values  of  the  ith 
unfolding  matrix  of  (V; ,  i.e.  (Yf) (,.) ,  and  then  folds  the  resulting  matrix  back  into  tensor. 
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